Relative Information Loss in the PCA 



Bernhard C. Geiger*, Gemot Kubin* 
* Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria 

{geiger,gernot. kubin} @tugraz.at 



(N 

o 

(N 



CO 



(N 
> 
On 
(N 
^- 
O 

o 

(N 



X 



Abstract — In this work we analyze principle component anal- 
ysis (PCA) as a deterministic input-output system. We show 
that the relative information loss induced by reducing the 
dimensionality of the data after performing the PCA is the same 
as in dimensionality reduction without PCA. Furthermore, we 
analyze the case where the PCA uses the sample covariance 
matrix to compute the rotation. If the rotation matrix is not 
available at the output, we show that an infinite amount of 
information is lost. The relative information loss is shown to 
decrease with increasing sample size. 

I. Introduction 

Principle component analysis (PCA) is a powerful tool for 
both linear decorrelation and dimensionality reduction, and is 
thus widely used in machine learning, neural networks, and 
feature extraction JT], [[Z)- A vast literature proves the opti- 
mality of PCA in an information theoretic sense for specific 
cases: Linsker [3| proved that for A^-dimensional Gaussian 
data corrupted by Gaussian noise with diagonal covariance ma- 
trix, the PCA maximizes the mutual information between the 
data and a one-dimensional output random variable. Plumbley 
argued in |4) that for non-Gaussian data the PCA is the 
linear transform which minimizes an upper bound on the 
information lost due to dimensionality reduction. Furthermore, 
the authors of |Q] show that under some circumstances the 
PCA minimizes redundancy in the output data, i.e., is an 
optimal linear independent component analysis. While this list 
of previous works is clearly not complete, it highlights the 
utility of information-theoretic measures to characterize the 
performance of PCA algorithms. 

In the present work, we pursue a slightly different approach: 
We view the PCA as a multivariate, vector-valued input-output 
system and analyze it in terms of information loss. In this con- 
text, information means the total information available at the 
input of the system, in contrast to |3|. [4|, which considered 
only information which is relevant. As a consequence, while 
our analysis may superficially appear to contradict the results 
available in the literature, it rather provides a different view 
on PCA as an information-processing system. 

In its standard notation, PCA is a linear system 



Y = W X 



(1) 



where X and Y are the iV-dimensional input and output 
vectors, respectively, and W is an orthogonal matrix. In this 
sense, the PCA is a bijective transform and thus invertible. 
Typically, however, the output vector will be truncated after 
performing the PCA in order to reduce the dimensionality of 
the data. In this projection to a lower-dimensional subspace 



information is lost, and we intend to quantify the loss in this 
work. 

Let us be more specific: Assuming continuous-valued ran- 
dom variables (RV), the information available at the input is in- 
finite. In particular, assuming joint continuity, each component 
of the multi-dimensional input RV contains an infinite amount 
of information; thus, if the dimension is reduced, i.e., if some 
components are dropped, an infinite amount of information is 
lost (cf. 0). 

In case the orthogonal matrix is not known a priori but has 
to be estimated from a set of input data vectors collected in 
the matrix X, the PCA becomes a nonlinear operation: 



Y = w(X)X. 



(2) 



Here, w is a matrix-valued function which computes the 
orthogonal matrix required for rotating the data (e.g., using 
the QR algorithm [6|). If this orthogonal matrix is not stored 
and made available at the output, one will readily agree that 
information is lost even if - at a first glance - the dimension 
of the data is not reduced. 

Our notion of relative information loss captures the ratio of 
input information which cannot be retrieved at the output. We 
make this statement precise in Section [II] Before analyzing 
the information loss in PCA in Section [V] we present a 
general theorem about the loss in systems which reduce the 
dimensionality of the data (Section [TTTb . Section [IV] acts as 
a bridge, containing some toy examples which should give 
an intuitive understanding of relative information loss. In 
Section I VII the PCA using an input data matrix (cf. (fJJ) to 
perform the rotation is shown to destroy information even if 
the dimensionality is not reduced. Since none of our findings 
explains the usefulness of the PCA, we finally give an outlook 
of how to corroborate its optimality using a different notion of 
information loss in Section IVlll We there also discuss possible 
implications for a system theory from an information-theoretic 
point-of-view. 

II. Relative Information Loss - A Quick Overview 

In this section we will briefly present the basic properties 
of relative information loss. To this end, we introduce 

Definition 1 (Relative Information Loss). Let X be an N- 
dimensional RV valued in X, and let Y be obtained by 
transforming X with a static function g, i.e., Y = g(X). We 
define the relative information loss induced by this transform 

as 

#(X n |Y) 



/(X -> Y) 



lim 



ff(X„) 



(3) 



where X„ = ^" X J (element-wise). The quantity on the left is 
defined if the limit on the right-hand side exists. 

Loosely speaking, this quantity represents the percentage 
of input information which is lost in the system, with every 
bit weighted equally. Taking, for example, an 8 bit numbeo 
losing the most significant or the least significant bit both 
amounts to a relative information loss of l(X — > Y) = |. The 
advantage of this definition is its independence of application- 
specific aspects (where, e.g., the most significant bit may be 
more important than the least significant one; cf. Section IVIII 
for a discussion). 

The motivation for introducing this quantity is to com- 
plement the absolute notion of information loss, given by 
i(X — >• Y) = £f(X|Y), which suffers from being infinite 
in many practically relevant cases (cf. 0). 

Due to the non-negativity of entropy and the fact that con- 
ditioning reduces entropy, it follows that Z(X — > Y) G [0, 1]. 
Moreover, for continuous input RVs X, a nonzero relative loss 
corresponds to an infinite absolute loss: 

Proposition 1. Let X be such that H(X) = oo and let 
l(X -> Y) > 0. Then, L(X -» Y) = #(X|Y) = oo. 

Proof: We prove the proposition by contradiction. To this 
end, assume that i?(X|Y) = L < oo. Then, 

^Y)= lim g ff?J Y > = Inn inf g ff?J Y > (4) 



H(X n 
H(X\Y) 



H(X n ) 



< lim inf ■ 

n^oo H(X n ) 

— lim inf — = 







(5) 
(6) 



where the inequality is due to data processing. The last 
equality follows from the fact that at least a subsequence of 
H(X„) converges to H(X) (cf. Q). ■ 
Furthermore, we can show a tight connection to the infor- 
mation dimension introduced by Renyi [8|. We therefore need 
the following 

Definition 2 (Information Dimension [8]). The information 
dimension of an RV X is given as 



d(X) 



#(X„) 



lim 

n->oc log n 



(7) 



provided the limit on the right exists. 



Renyi then showed that the asymptotic behavior of H(X n ) 
depends strongly on the information dimension of X: 

Lemma 1 (Asymptotic behavior of iJ(X„)). Let X be an RV 
with existing information dimension d(X) and let iJ(Xi) < 
oo. Then, for n — > oo the entropy of X n behaves as 



H(X n ) = d(X) log n + h + o(l) 



(8) 



'I.e., X is a scalar RV which can be represented by eight independent 
Bernoulli- \ RVs; thus H(X) = 8 bit. 



where h is the d(X)- dimensional entropy of X (provided it 
exists). 

Proof: See (8) (cf. also JD, lTTQ| ">. ■ 
In particular, if X has a probability measure absolutely 
continuous w.r.t. the A^-dimensional Lebesgue measure p, N 
(Fx *C (J> N ), then the information dimension d(X) = N and 
h denotes the differential entropy of X (see Theorems 1 & 
4 in [8]). We are now ready to make the connection between 
relative information loss and information dimension in 

Theorem 1. Let X be an RV with positive (finite) information 
dimension d(X). Then, if d(X\Y = y) exists and is finite Py- 
a.s., the relative information loss equals 



l(X 



Y) = 



d(X\Y) 
d(X) 



(9) 



where d(X\Y) = f y d(X\Y = y)dP Y (y). 

Proof: We start with Definition Q] and obtain 



/(X ->• Y) = lim 



g(X w |Y) 
H(± n ) 



f y H(X n \Y=y)dPY 
lim — 



lim 



H{X n ) 

log n 

g(Xn) 



(10) 



(11) 



(12) 



where we divided both the numerator and the denominator 
by logn. Assuming now the limits of the numerator and 
denominator exist and are finite we can continue with 

l im f g(*„|Y=y) ,p 

/(X -> Y) = ^ £52i_ (13) 



lim„_ 



g(x„) 

log n 
tf(X„|Y=y), 



iimn^oo Jy log „ "^Y 

d(X) 



(14) 



where we employed Definition [2] By assumption the limit 



d(X|Y 



lim 

n— ¥oo 



g(X n |Y = y) 
logn 



(15) 



exists Py-^.s. Since the information dimension of an RV is 
upper bounded by the topological dimension of its support 
(e.g., by the number of coordinates of the vector X), one can 
apply Lebesgue's dominated convergence theorem (e.g., [11 1) 
to exchange the order of the limit and the integral. This 
completes the proof. ■ 
Before proceeding, we want to mention that the converse 
of Proposition Q] is not true@, i.e., that if H(X) = oo 
and l(X — > Y) = 0, it does not necessarily follow that 
L(X — > Y) < oo. To this end, assume that the function g 
is such that, for all y, X| Y — y is a discrete RV with infinite 
entropy but with H(X\\Y = y) < oo. As a consequence, the 

2 We thank an anonymous reviewer for pointing us to this fact. 



information dimension d(X|Y = y) = 0, which establishes 
i(X — > Y) = 0. However, 

L(X^Y)= f P(X|Y = y)dP Y (y) = 00. (16) 

We give an example for such a case in the Appendix. 
III. Relative Information Loss for Functions 

WHICH REDUCE DIMENSIONALITY 

We now proceed to analyzing the relative information loss 
for measurable functions g: X — » y which map subsets of 
X C R N to sufficiently well-behaved submanifolds of R N . In 
particular, let {Xi}, i = 1, . . . , L, denote a finite partition of X 
and let Px <C (1 , where fi is the iV-dimensional Lebesgue 
measure. Here and throughout the remainder of this paper we 
assume that all involved information dimensions exist and are 
finite. The latter restriction is fulfilled under the mild condition 
that H(X n ) < 00 0, which for scalar (one-dimensional) RVs 
X translates to E {\X\ e } < 00 for some e > lfl2l . 

We maintain 

Theorem 2. Let X be such that Px "C H N is supported on 
X C R w and let {X{\ be a partition of X such that each 
of its L elements is a smooth N -dimensional manifold. Let g 
be such that g^ = g|^ are submersions to disjoint smooth 
rii-dimensional manifolds y^ Then, the relative information 
loss is 

z=l 

Proof: By the submersion theorem (e.g., [13]) and the 
fact that the images of Xi are disjoint, the preimage of every 
point in 3^ is a closed submanifold of Xi with codimension 
rii. We can thus use Theorem Q] together with d(X) = N to 
get 

/(X^Y) = 1W d(X|Y = y)dP Y (y) 

-Y, / (^-«i)^(y)=Eir iV(3;i) (18) 



=1 



which completes the proof by noticing that Py(3^) = 

PM). m 

Before proceeding, two interesting facts are worth mention- 
ing: First of all, as a submersion is a smooth mapping between 
smooth manifolds, we can state 

Corollary 1. If Px p N and if g is as in Theorem\2\ then 

d(Y) 



l(X -+ Y) = 1 



d(X) 



(19) 



provided the information dimension of Y exists. 

Proof: We note from the proof of Theorem [2] that 

L 

Z(X^Y) = 1-^|Py(^) (20) 



i=l 



where TV = d(X). We now show that d(Y|X e A*) = rt,;. 
By the fact that g\x i G C°° is a submersion, the preimages of 
//H-null se ts are /i^-null sets themselves [14]. Consequently, 
since Px the conditional probability measure sup- 

ported on yi is absolutely continuous w.r.t. the n;-dimensional 
Lebesgue measure (PYixg^ *C p ni )- Since is a smooth 
m -dimensional manifold, we again obtain with Renyi [8| 
that d(Y|X € Xi) = m. The proof is complete with [12 
Theorem 2] or [15], noting that 

L L 

d(Y) =^d(Y|X e A-^Px^) =2^(3^). (21) 



In fact, already Renyi showed (|2TT > for a mixture of a one- 
dimensional continuous RV and a discrete RV in |8]. For 
more complicated measures (e.g., measures with a non-integer 
information dimension) or more general functions it might be 
hard to prove Theorem |2] and Corollary [T] We conjecture, 
however, that the conditions imposed in Theorem [2] can be 
loosened such that at least for a larger class of functions (e.g., 
those for which the images of Xi are not disjoint) the relative 
information loss can be evaluated. 

Secondly, we believe that one should be able to analyze 
cascaded systems, at least in a restricted class of cases: To 
this end, assume that we have a cascade of two projections: 
one on the first two and then one on the first coordinate, i.e., 
g: [X U X 2 ,X 3 ] -»• [Xi,X 2 ] and h: [X U X 2 ] -¥ X v Let the 
RVs X\, X2, and X3 have a continuous joint distribution. 
Then, according to Theorem [2] the function g destroys one 
third of the information, while h destroys half of the remaining 
information. In total, two thirds of the information are lost. 
Indeed, we obtain 



I ( [Xi 7 X2 , X3 



1 



1 1 1 



X!) = - + ---- = - 



3 2 



(22) 



We are thus lead to the following 



Conjecture 1. Let g be a (measurable) function describing a 
system with input X and output Y. Let further h be another 
such system function which transforms input Y to output Z. 
There exist conditions under which 



= l(X ->■ Z) = /(X -> Y) + l(Y -> Z) 



■l(X -> Y)l(Y -> Z). 

(23) 



Clearly, our example of the cascade of two projections ful- 
fills these conditions. Moreover, as we show in the Appendix, 
this conjecture holds for discrete RVs with finite entropy. 

Finally, it is worth mentioning that the shape of the dis- 
tribution has no effect on the relative amount of information 
lost. It is essentially this behavior which leads to the somewhat 
counter-intuitive results presented in the following sections. 

IV. Toy Examples for Dimensionality Reduction 

With the help of a few simple examples we now try to 
make the operational meaning of relative information loss 
intuitive. At the same time we highlight its importance to the 
development of an information-centered system theory. 



7(*) 



Fig. 1 . The center clipper - another example for dimensionality reduction. 

Let us introduce a two-dimensional RV X, Px *C /i 2 and 
d(X) = 2, and a transform simply adding the two vector 
components, i.e., 



Y = g(X) = X 1 +X 2 . 



(24) 



We can represent this transform by a cascade of an invertible 
linear transform (e.g., T: X — > [X\ + X2,Xi}) and a simple 
projection onto the first component. Let now X = [X\ + 
X2, X±], Since T is invertible and linear, it is bi-Lipschitz and 
thus preserves the dimension of the transformed RVs. Thus, 



Z(X -> Y) = Z(X ->Y) = 0.5 



(25) 



by Theorem [2] Note that this is another example where 
Conjecture [TJ holds. 

With this result and the underlying theorems, the theory of 
information-processing systems can be extended from treating 
cascade structures as in (l23l to also considering parallel 
structures whose outputs are added. This were not possible 
by considering only absolute information loss (e.g., Q), as it 
would be infinite in these cases. 

As a second toy example we consider a center clipper (see 
Fig. [TJ, which is commonly used for noise suppression or 
residual echo cancellation |[T6l . We describe the center clipper 
by the following function: 




(26) 



Clearly, the domain of this function can be partitioned into 
three elements, upon two of them the function is the identity 
function. On the third set, [— c, c], the function is a submersion 
to a zero-dimensional manifold. We can thus apply Theorem[2] 
and obtain l(X -> Y) = P x ([-c,c]). 

It is interesting to see that in both examples the distribution 
of the input signal does not have an influence on the relative 
information loss, as long as it is continuous. This is counter- 
intuitive in the sense that adding two strongly correlated RVs 
should preserve more information than adding two indepen- 
dent RVs, or, in the sense that center clipping a large signal 
should not hurt too much. This intuition, however, is based 
on the fact that one tends to attribute unequal importance 
to different aspects of the information contained in the input 
signal (e.g., principle direction, magnitude, etc.). 



V. PCA with Population Covariance Matrix 

In PCA one uses the eigenvalue decomposition (EVD) of 
the covariance matrix of a multivariate input to obtain a 
different representation of the input vector. In particular, let 
X be an RV with distribution Px *C fi and information 
dimension rf(X) = N. We further assume that X has zero 
mean and a positive definite population covariance matrix 
C x = E{XX T } which is known a priori. The case where 
C x is not known but has to be estimated from the data is 
considered in Section [VI] 

The EVD of the covariance matrix yields 



c x = W£W T 



(27) 

where W is an orthogonal matrix (i.e., W _1 = W T ) and 
E is a diagonal matrix consisting of the N eigenvalues of 
C x . We now can describe the PCA by the following linear 
transform: 



Y = g(X) = W 1 X. 



(28) 



As in Section [TV] the linear transform is bi-Lipschitz, and the 
information loss vanishes^. 

Often, however, the PCA is used for dimensionality reduc- 
tion, where after the linear transform in (l28l the elements of 
the random vector Y with the smallest variances are discarded 
(thus, preserving the subspace with the largest variance). 
Essentially, the mapping from Y to, e.g., 



Y M = [Yi,...,Yi 



(29) 



is a projection onto the first M < N coordinates, which is a 
submersion between two smooth manifolds. We can thus apply 
Theorem [2] and obtain the relative information loss Z(Y — > 
Yjj) = N Jf M ■ in analogy with the example in Section ITVl we 
thus get 

N — M 

Z(X -> Y„) = il-^J-. (30) 

We can now extend this analysis to the case where from 
Yjk an (A^-dimensional) estimate Xj,f of the original data 
X is reconstructed. This estimate is obtained using a linear 
transform 

X m = WImY m (31) 



where \ M is a rectangular identity matrix with M rows 
and N columns. The (full-rank) matrix is a mapping 
to a higher-dimensional space (i.e., from 



to 



) and 

is thus bi-Lipschitz; so is the rotation with the matrix W . 
Furthermore, the transform from Y m to X m is invertible and, 
as a consequence, no additional information is lost. We thus 
state 



Z(X -> X M ) = 



N -M 
N 



(32) 



where, using above notation, = WI ^ f I M W T X. 

Indeed, the same result would have been obtained if the 
rotation would have been performed using any other orthog- 
onal matrix and regardless which elements of the rotated 



3 Not only the relative information loss i(X 
absolute information loss L(X — ¥ Y) = 0. 
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Fig. 2. The PCA as a nonlinear input-output system. "Cov" denotes the 
computation of the sample covariance matrix and "EVD" stands for eigenvalue 
decomposition. 

vector were discarded. In particular, also if just the first M 
components of X would have been preserved, we would have 
i(X — > X^) = N N M ■ 

Of course, PCA is known to be optimal in the sense that, by 
discarding the elements of Y with the smallest variances, the 
mean-squared reconstruction error for Xm is minimized (TJ. 
For this interpretation, measuring the information loss with 
respect to a relevant random variable may do the trick, 
providing us with a statement about the optimality of PCA 
in an information-theoretic sense (cf. 0~), H). Conversely, 
if one cannot determine which information at the input is 
relevant, one has no reason to perform PCA prior to reducing 
the dimension of the data. 

VI. PCA with Sample Covariance Matrix 

We argued in Section [V] that the PCA without dimensional- 
ity reduction is an invertible transform. This, however, is only 
true if one has access to the orthogonal matrix W; if not, 
i.e., if one feeds a system with the data matrix X and just 
receives the rotated matrix Y (see Fig. |2}, it can be shown 
that information is lost. In this section we make this statement 
precise. 

We believe that the following analysis can be generalized 
to all data matrices with a continuous joint distribution, i.e., 
Px -C fJ, nN . For the sake of simplicity, in this paper, we will 
focus on a particularly simple scenario: Let X denote a matrix 
where each of its n columns represents an independent sample 
of an A-dimensional Gaussian RV X. Again, let X have 
zero mean and positive definite population covariance matrix 
C x . As a consequence, the probability distribution of the data 
matrix X is absolutely continuous w.r.t. the niV-dimensional 
Lebesgue measure (P x <C fi nN ). 

The sample covariance matrix C x = - XX T is symmetric 
and almost surely positive definite. In the usual case where 
n > N one can show that N ( N+1 ) entries can be chosen 
and that the remaining entries depend on these in a deter- 
ministic manner. Indeed, since in this case the distribution of 
C x possesses a density (the Wishart distribution, cf. ifTTl '). 
the distribution is absolutely continuous w.r.t. the Lebesgue 
measure on an ■ jV ( J ^+ 1 ) -dimensional submanifold of the A 2 - 
dimensional Euclidean space. With some abuse of notation we 

JV (JV + 1) 

thus write P^, <C /i 2 



The orthogonal matrix W for PCA (see d28t : now applied 
to the matrix X instead of the vector X) is obtained from the 
EVD of the sample covariance matrix, i.e., 



WSW 



(33) 



where X is the diagonal matrix containing the eigenvalues 
of C x . The joint distribution of the N eigenvalues of C x 
possesses a density fl7l Ch. 9.4]; thus, the distribution of S 
is absolutely continuous w.r.t. the Lebesgue measure on an 
A-dimensional submanifold of the A 2 -dimensional Euclidean 
space, or Pj, <C \i . 

Clearly, the entries of C x are smooth functions of the 
eigenvalues and the entries of W. Images of Lebesgue null- 
sets under smooth functions between Euclidean spaces of 
same dimension are null-sets themselves; were the probability 
measure P^ supported on some set of dimensionality lower 

than Ar( - j ^~ 1 - > , the image of the product of this set and M. N (for 
the eigenvalues) would be a Lebesgue null-set with positive 
probability measure. Since this contradicts the fact that C x is 
continuously distributed, it follows that 



Pw 



(34) 



We now argue (see Appendix for a rigorous discussion) that 
the rotated data does not tell us anything about the rotation, 
hence 

Af(Af-l) 

P W|Y=y«M 2 " (35) 

Knowing Y = y, X is a linear, bi-Lipschitz function of 
W , and thus the information dimension remains unchanged. 
Therefore we get with Theorem Q] 

n d(X\Y) N(N - 1) N - 1 



l(X 



d(X) 



2nN 



2n 



(36) 



We now drop a little of the mathematical rigor to analyze, 
for the sake of completeness, the less common case where 
there are less data samples than there are dimensions for each 
sample (n < N). In this case, the sample covariance matrix 
is not full rank, which means that the EVD yields N — n 

2 JV — Ti + l 

vanishing eigenvalues. Assuming that still P<=s <C /i™ 2 
one finds along the same lines as in the case n > N that the 
loss evaluates to 

2N - n - 1 



2N 



(37) 



The behavior of the relative information loss as a function of 
n is shown in Fig. [3] for different choices of N . A worked 
example for a particularly simple case can be found in the 
Appendix. 

The relative information loss induced by PCA results from 
the fact that one cannot know which rotation led to the 
output data matrix. As a consequence, the relative information 
loss decreases with a larger number of samples: the total 
information increases while the uncertainty about the rotation 
remains the same. Note further that the relative information 



loss cannot exceed 



iV-l 
JV 



(for n = 1), which is due to the fact 



that the rotation preserves the norm of the sample. 




■i 1 i- 



5 10 20 n 

Fig. 3. Relative information loss in the PCA with sample covariance matrix as 
a function of the number n of independent measurements. The cases N = 5 
(black), N = 10 (red), and N = 20 (blue) are shown. The dashed lines 
indicate the conjectured loss for singular sample covariance matrices. 

The PCA with sample covariance matrix also allows dif- 
ferent interpretations of information loss, in addition to the 
information lost in the rotation: First, as we show in the 
Appendix, by the fact that the sample covariance matrix of Y 
is a diagonal matrix, the possible values of Y are restricted to 
a submanifold of dimensionality smaller than nN. Naturally, 
this also restricts the amount of information which can be 
conveyed in the output. In contrary to this, in PCA using the 
population covariance matrix the sample covariance matrix of 

Y will (almost surely) do not contain zeros, so the entries of 

Y will not be restricted deterministically. 

Finally, it is interesting to observe that in this case the abso- 
lute information loss in the PCA is infinite (see Proposition [TJ, 
even if no additional dimensionality reduction is performed. 
Moreover, this analysis not only holds for the PCA, but for any 
rotation which depends on the input data in a similar manner 
- in this sense, the PCA is not better than any other rotation. 

VII. Discussion 

Relative information loss was introduced to cope with the 
shortcomings of absolute information loss, especially in cases 
where the information loss L(X — > Y) and the information 
transfer J(X;Y) are infinite. However, as the examples of 
Section [TV] and [V] showed, even relative information loss not 
necessarily provides full insight and intuitive interpretations in 
some cases. 

These cases may be characterized by the fact that not all 
information at the input of the system is relevant for a given 
application, i.e., we are actually not interested in X itself but 
in some random variable Z somehow related to it. Z, however, 
is not accessible directly, but only through a function g of the 
related RV X. The logical consequence is thus to define a 
quantity which captures the information loss relevant w.r.t. Z. 
We did so in ifTSl . where we analyzed some of this quantity's 
properties and made a connection to the signal enhancement 
problem. 

A very similar quantity has already been introduced by 
Plumbley [4] in the context of unsupervised learning. Using 
this quantity he proved the optimality of the PCA under 
appropriate constraints, and argued that information loss in 
some cases is more versatile than mutual information. 

In order to be able to minimize this relevant information 
loss, one clearly has to know something about the relevant 



RV Z and its relationship to the system input X. If one 
does not have this knowledge, all bits of information have 
to be treated equally, leading to our notion of (relative) 
information loss. As a direct consequence, while in some 
cases the PCA prior to dimensionality reduction might be the 
optimal solution (cf. H, |[T8ll ). without knowledge about the 
relevant information one has no reason to apply it to a data 
set. Even more so, as we showed in Section [VI] even without 
reducing the dimensionality, information can be lost, which 
should prevent one from unjustified use of PCA. 

We now discuss a different aspect of our work about 
information loss - be it absolute or relative - in deterministic 
systems. The definition of these quantities allows us to quan- 
tify and, hopefully, understand the propagation of information 
in a network of systems. 

Take, for example, the PCA: The information at the input 
is split and propagates through the system, with parts of it 
being lost in some paths and preserved in others. By denoting 
t(X —> •) = 1 — i(X — > •) the relative information transfer, 
we can thus redraw the system model and obtain Fig. |4]where 
the arrows are labeled according to the relative information 
transfer along them. In particular, the eigenvalue decomposi- 
tion splits the information contained in the covariance matrix 
into a part describing the eigenvectors and a part describing 
the eigenvalues. The former part is lost in PCA, while the 
second is preserved in the output. However, as we already 
mentioned before, the rotation (i.e., the multiplication with 
the orthogonal matrix) removes exactly as much information 
from the input as is contained in the orthogonal matrix. The 
information contained in both Y and W suffices to reconstruct 
X. If we perform in addition a sphering transform on Y (thus 
making all eigenvalues unity), one needs the triple Y, W and 
X to reconstruct X. 

It is obvious that this transfer graph must not be understood 
in the sense of a preservation theorem, similar to Kirchhoff 's 
current law: Information can be split and fused, and after 
splitting the sum of information at the output needs not equal 
the sum of information at the input (as it is, by coincidence, 
for the EVD). If the output information is less than the input 
information, the intermediate system destroyed the remaining 
information (as in rotation). Conversely, the if the sum of 
output information exceeds the input information, this only 
means that parts of these information must be the same (as, 
e.g., Y includes the knowledge about S). 

VIII. Conclusion 

We have introduced the notion of relative information loss 
to analyze systems for which the output has a lower dimen- 
sion than the input, exploiting a tight connection to Renyi's 
information dimension. As a first result, we showed that the 
relative information loss for dimensionality reduction is not 
affected by performing a principle component analysis (PCA) 
beforehand. 

We then showed that even without dimensionality reduction 
the PCA is not information lossless, given that the sample 
covariance matrix is used to compute the rotation matrix. 
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Fig. 4. Information propagation in PCA with sample covariance matrix. The values on the arrows indicate the relative information transfer t(X — > •). We 
assume n > N in this case. The gray part considers the sphering transform (see text). Note that the separate information transfers to S, W , and Y add up 
to 1 + — > 1. This is because the information in X is already contained in Y. Note further that t(X — > Y, S, = 1, as expected. 



There, the relative information loss appears to decrease with 
increasing sample size. 

We proposed to use these somewhat counter-intuitive results 
as motivation to introduce a notion of information loss which 
takes the relevance of the data into account, and thus connects 
to results available in the literature. A detailed analysis of this 
relevant information loss is within the scope of future work. 
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Appendix A 

Example for infinite absolute and vanishing 
relative information loss 

We now give an example where an infinite amount of 
information is lost (i.e., L(X — > Y) = oo), but for which the 
relative information loss nevertheless vanishes (i.e., l(X — ) 
Y) = 0). To this end, assume that we consider the following 
scalar function g: (0, 1] — > (0, 1] 

g(x) = 2 n (x - 2- n ) if x e (2- n , 2- ,l+1 ], n G N (38) 

In other words, this function maps every interval (2~™, 2~™ +1 ] 
onto the interval (0,1]. Assume now further that the input 
variable X has a continuous distribution with density function 



fx(x) - 2" 



1 



1 



log(ra+l) log(n + 2) / 

if x S (2-",2-" +1 ], neN (39) 



where log denotes the binary logarithm. As an immediate 
consequence, the output RV Y is uniformly distributed on 
(0, 1]. Furthermore, the information dimension of the input 
is given by d(X) = 1. 



Since the function is piecewise strictly monotone we can 
apply the reasoning of [5 1 and claim that 



L(X -> Y) = H(X\Y) = H(W\Y) 



(40) 



where 



W = nif X e{2- n ,2- n+1 }. (41) 
In particular, for the given input density function we obtain 



Pr(W = n\Y =y) = Pr{W = n) 

1 



1 



(42) 



log(n + 1) log(n + 2) 

For this distribution, however, it is known that the entropy is 
infinite, i.e., H{W) = oo 1191 . This shows that in this case 
the absolute information loss is infinite as well. 

However, for every y € (0, 1] the preimage is a countable 
set, thus X\Y = y is a discrete RV. In addition to that, 
since X is supported on a compact set, so is X\Y = y, and 
every quantization X n \Y = y will have finite entropy. Thus, 
d(X\Y = y) = for all y J8J , and with TheoremQ]we obtain 
l(X ->• Y) = 0. 

Appendix B 
Proof of ConjectureQ]for discrete RVs 

Let X, Y, and Z be discrete RVs with finite entropy, and 
let further Y = g(X) and Z = h(Y). As an immediate result, 
J(X; Y) = H(Y) and Z(Y; Z) = H(Z). We next note that 

ff(X|Z) 



;(x^ z) = 



(43) 



H(X) 

since the RVs are already discrete. Furthermore, we can write 
/(X^Z) = g(X) g( ^ X;Z) =l-t(X^Z) (44) 

where i(X — > Z) = ~ HQt) ' smce a l so Z is a function 

of X. Expanding the latter term - the relative information 



transfer - one obtains 



1 - Z(X -> Z) = t(X -> Z) = 



on an 



g(Z) g(Y) 



/(Y;Z)/(X;Y) 



= <(X -> Y)*(Y -> Z) 



ff(Y) tf(X) 
= (1 - i(X -> Y))(l - /(Y -> Z)) 

= 1-I(X->Y)- l(Y -> Z) + l(X -> Y)/(Y -> Z). 

(45) 

Rearranging completes the proof. ■ 
We note in passing that this result holds for the cascade 
of deterministic systems and that a generalization to Markov 
chains X — Y — Z is not possible. 

Appendix C 
Information dimension of X|Y = y 

N(N-l) 

We already observed that <C /i 2 . Furthermore, 
since we know that the sample covariance matrix of Y is a 
diagonal matrix, the corresponding equations 

1 - 

VI < i < j < N : (C Y ) y = ~J2 YikY J k = (46) 

k=l 

restrict the possible values of Y from an nA-dimensional to 
an Af-dimensional subspace with 



M = nN 



N(N - 1) 



(47) 



In fact, it can be shown that M elements of Y are random 
while the remaining (nN — M) depend on these in a deter- 
ministic manneo 

Af(iV-l) 

Since X = WY , X is a smooth function from K 2 x 
R M (the ranges of the values of W and Y) to W lN (the range 
of values of 2D, thus from R nN to W lN . Since this smooth 
mapping maps null-sets to null-sets |[T3l Lem. 6.2], we obtain 



P, 



(W.Y) 



(48) 



(We are well aware that W and Y together have more than 
nN entries, but only nN of those can be chosen freely. In 
other words, the graph of the functions defining the remaining 
entries of Y and W is an nA-dimensional submanifold of 
R N(n+N) jjj L em. 5.9].) The joint distribution of (W, Y) 
thus possesses a density, and by marginalizing and condition- 
ing so does W|Y = y. As a consequence, 



W|Y=y 



(49) 



Note further that this does not mean that W is independent 
of Y - it just means that these two quantities are at least not 
related deterministically. 

The final step is taken by recognizing that if one knows Y = 
y, then X|Y = y is a linear function of W|Y = y. Since 
Y has full rank, the linear function maps the A^ 2 -dimensional 
space of W (on which the probability mass is concentrated 

4 E.g., one could determine Yij from the equation for (Cyjij. 



^HLzll -dimensional subspace) to the A 2 -dimensional 
linear subspace of W nN . With [20, Remark 28.9] this transform 
is bi-Lipschitz and preserves the information dimension. Thus, 



d(X|X = y) 



N(N - 1) 



(50) 



Appendix D 

Example: PCA with singular Sample Covariance 
Matrix 

We now give a worked example for the - admittedly less 
common - case of a singular sample covariance matrix. Let 



X be a two-dimensional Gaussian RV, and let n 
X = X. The sample covariance matrix is given by 



1, i.e., 



(51) 



X1X2 A| 

and has eigenvalues |X| 2 and 0. The corresponding (normal 
ized) eigenvectors are then given by 

T 



and 



Pi 



P2 = 



sga(X 2 )X 1 \X 2 



sgn(A 1 )A 2 \Xi 



(52) 



(53) 



Performing now the rotation Y = W X with W = [pi, P2] 
one obtains 

Y= [sgn(A 2 )|X|,0] T . (54) 

The fact that the second component of Y is zero regardless of 
the entries of X makes it obvious that exactly one half of the 
information is lost, i.e., /(X — > Y) = i. This also corresponds 
to the result obtained in Section |VT] for N = 2 and n = 1. 
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